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^H ■ Abstract 



Two standard algorithms for approximately solving two-player zero-sum concurrent reachability 

C/2 ■ games are value iteration and strategy iteration. We prove upper and lower bounds of 2™ on the worst 

^ ' case number of iterations needed by both of these algorithms for providing non-trivial approximations 

to the value of a game with A'' non-terminal positions and m actions for each player in each position. 

^f^ • In particular, both algorithms have doubly-exponential complexity. Even when the game given as input 

^ ■ has only one non-terminal position, we prove an exponential lower bound on the worst case number of 

^Nj , iterations needed to provide non-trivial approximations. 

oo 

1 Introduction 

O . 1.1 Statement of problem and overview of results 



We consider finite state, two-player, zero-sum, deterministic, concurrent reachability games. For brevity, 
we shall henceforth refer to these as just reachability games. The class of reachability games is a subclass 
of the class of games dubbed recursive games by Everett [5] and was introduced to the computer science 
community in a seminal paper by de Alfaro, Henzinger and Kupferman :l! . A reachability game G is played 



S ■ between two players. Player I and Player IT The game has a finite set of non-terminal positions and special 

terminal positions GOAT and TRAP. [^ In this paper, we let N denote the number of non-terminal positions 
and assume positions are indexed 1, . . . , A^ while GOAT is indexed A^ -f- 1 and TRAP 0. At any point in 
time during play, a pebble rests at some position. The position holding the pebble is called the current 
position. The objective for Player I is to eventually make the current position GOAT. If this happens, play 
ends and Player I wins. The objective for Player II is to forever prevent this from happening. This may 
be accomplished either by the pebble reaching TRAP from where it cannot escape or by it moving between 
non-terminal positions indefinitely. To each non-terminal position i is associated a finite set of actions A] , Af 



*Work supported by Center for Algorithmic Game Theory, funded by the Carlsberg Foundation. The authors acknowledge 
support from The Danish National Research Foundation and The National Science Foundation of China (under the grant 
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^Including the TRAP position in the setup is actually not strictly needed, as one could replace it with any non-terminal 
position from which no escape is possible, but including it is quite convenient and fairly standard. In particular, including it 
makes "a reachability game with one non-terminal position" mean what we think it should. 



for each of the two players. In this paper, we assume that all these sets have the same size m (if not, we 
may "copy" actions to make this so) and that Aj = Af = {1, . . . ,m}. At each point in time, if the current 
position is i, Player I and Player II simultaneously choose actions in {1, . . . , m}. For each position i and 
each action pair {a, a') e {!,..., m}^ is associated a position 7r(i, a, a'). In other words, each position holds 
an m X m matrix of pointers to positions. When the current position at time t is i and the players play the 
action pair (a, a'), the new position of the pebble at time i + 1 is 7r(«, a, a'). 

A strategy for a reachability game is a (possibly randomized) procedure for selecting which action to 
take, given the history of the play so far. A strategy profile is a pair of strategies, one for each player. A 
stationary strategy is the special case of a strategy where the choice only depends on the current position. 
Such a strategy is given by a family of probability distributions on actions, one distribution for each position, 
with the probability of an action according to such a distribution being called a behavior probability. We 
let iii{x,y) denote the probability that Player I eventually reaches GOAL if the players play using the 
strategy profile {x,y) and the pebble starts in position i. The lower value of position i is defined as: 
Wi = supj-ggi infj,g52 ^i(x, y) where S^ (S*^) is the set of strategies for Player I (Player II). Similarly, the 
upper value of a position i is ui = infj,g5'2 sup^.^^! Hi{x,y). Everett [8] showed that for all positions i in a 
reachability game, the lower value Vi in fact equals the upper value Vi, and this number is therefore simply 
called the value Vi of that position. The vector v is called the value vector of the game. Furthermore, Everett 
showed that for any e > 0, there is a stationary strategy a;* of Player I so that for all positions i, we have 
infyg52 fii(x*,y) > Vi — e, i.e. the strategy x* guarantees the value of any position within e when play starts 
in that position. Such a strategy is called e-optimal. Note that x* does not depend on i. It may however 
depend on e > and this dependence may be necessary, as shown by examples of Everett. In contrast, it 
is known that Player II has an exact optimal strategy that is guaranteed to achieve the value of the game, 
without any additive error [ITl [13] . 

In this paper, we consider algorithms for solving reachability games. There are two notions of solving a 
reachability game relevant for this paper: 

1. Quantitatively: Given a game, compute e-approximations of the entries of its value vector (we con- 
sider approximations, rather than exact computations, as the value of a reachability game may be an 
irrational number). 

2. Strategically: Given a game, compute an e-optimal strategy for Player I. 

Once a game has been solved strategically, it is straightforward to also solve it quantitatively (for the same e) 
by analyzing, using linear programming, the finite state Markov decision process for Player II resulting when 
freezing the computed strategy for Player I. The converse direction is far from obvious, and it was in fact 
shown by Hansen, Koucky and Miltersen [12j that if standard binary representation of behavior probabilities 
is used, merely exhibiting an (l/4)-optimal strategy requires worst case exponential space in the size of the 
game. In contrast, a (l/4)-approximation to the value vector obviously only requires polynomial space to 
describe and it may be possible to compute it in polynomial time, though it is currently not known how to 
do so [6,. 

There is a large and growing literature on solving reachability games [U [ll |4j [21 El [12]. In this paper, 
we focus on the two perhaps best-known and best-studied algorithms, value iteration and strategy iteration. 
Both were originally derived from similar algorithms for solving Markov decision processes jl5) and discounted 
stochastic games (Hj. We describe these algorithms next. Value iteration is Algorithm [T] Value iteration 
approximately solves reachability games quantitatively. 



Algorithm 1: Value Iteration 



i :=0 ; 



v^ := (0, . . . , 0, 1) // the vector v^ is indexed 0, 1, , 
while true do 



, iV, iV + 1 



:=0 
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:= 1 



for ie{l,2,...,N} do 



L 



ya\{Mv'-')) 



Algorithm 2: Strategy Iteration 



1 t := 1 ; 

2 x^ := the strategy for Player I playing uniformly at each position; 

3 while true do 
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y* := an optimal 6esf reply by Player II to 
for i€{0,l,2,...,N,N + 1} do 

L ^'i :=M»(a;*,y*) ; 

for i e {1,2,...,A^} do 

if val(A,(u*-i)) > w*~i then 



X 



else 



c* ;= maximin(A,j(w* ^)) 



L 



In the pseudocode of Algorithm [1] the matrix Ai{v* ^) denotes the result of replacing each pointer to 
a position j in the m x m matrix of pointers at position i with the real number v~ . That is, Ai{v*~^) 
is a matrix oi m x m real numbers. Also, val(^i(u*^^)) denotes the value of the matrix game with matrix 
A-i{v^~^) and the row player being the maximizer. This value may be found using linear programming. 
Value iteration works by iteratively updating a valuation of the positions, i.e., the numbers v\. Clearly, when 
implementing the algorithm, valuations w* only have to be kept for one iteration of the while loop after the 
iteration in which they are computed and the algorithm thus only needs to store 0{N) real numberso As 
stated, the algorithm is non-terminating, but has the property that as t approaches infinity, the valuations {;* 
approach the correct values Vi from below. We present an easy (though not self-contained) proof of this well- 
known fact in section [2 . II below . and also explain the intuition behind the truth of this statement. However, 
until the present paper, there has been no published information on the number of iterations needed for the 
approximation to be an e-approximation to the correct value for the general case of concurrent reachability 
games, though Condon [5j observed that for the case of turn-based games (or "simple stochastic games"), 
the number of iterations has to be at least exponential in N in order to achieve an e-approximation. Clearly, 
the concurrent case is at least as bad. In fact, this paper will show that the concurrent case is in fact much 
worse. 

Strategy iteration is Algorithm [2] It approximately solves reachability games quantitatively as well as 
strategically. In the pseudocode of Algorithm [21 the line "y* := an optimal best reply to x*" should be 
interpreted as follows: When Player I's strategy has been "frozen" to a;*, the resulting game is a one-player 
game for Player II, also known as an absorbing Markov decision process. For such a process, an optimal 
stationary strategy y* that is pure is known to exist, and can be found in polynomial time using linear 
programming |15| . The expression uiaxiunn{Ai{v*~^)) denotes a maximin mixed strategy (an "optimal 



^ In this paper, we assume the real number model of computation and ignore the (severe) technical issues arising when 
implementing the algorithm using finite-precision arithmetic. 



strategy") for the maximizing row player in the matrix game Ai(w*~^). This optimal strategy may again be 
found using linear programming. The strategy iteration algorithm was originally described for one-player 
games by Howard [15 , with Player I being the single player - in that case, in the pseudocode, the line "y* := 
an optimal best reply to x*" is simply omitted. Subsequently, a variant of the pseudocode of Algorithmic] 
was shown by Hoffman and Karp [14] to be a correct approximation algorithm for the class of recurrent 
undiscounted stochastic games and by Rao, Chandrasekaran and Nair |18j to be a correct algorithm for the 
class of discounted stochastic games. Finally, Chatterjee, de Alfaro and Henzinger [2] showed the pseudocode 
of Algorithm [5] to be a correct approximation algorithm for the class of reachability games. As is the case for 
value iteration, the strategy iteration algorithm is non-terminating, but has the property that as t approaches 
infinity, the valuations v\ approach the correct values Vi from below. Chatterjee et al. [H Lemma 8] prove 
this by relating the algorithm to the value iteration algorithm. In particular, they prove: 

v\ < v\ < V,. (1) 

That is, strategy iteration needs at most as many iterations of the while loop as value iteration to achieve a 
particular degree of approximation to the correct values Vi. Also, the strategies x* guarantee the valuations 
w* for Player I, so whenever these valuations are e-close to the values, the corresponding x* is an e-optimal 
strategy. However, until the present paper, there has been no published information on the number of 
iterations needed for the approximation to be an e-optimal solution, though a recent breakthrough result of 
Friedman [9 proved that for the case of turn-based games, the number of iterations is at least exponential 
in N in the worst case. Clearly, the concurrent case is at least as bad. In fact, this paper will show that the 
concurrent case is much worse! 

As our main result, we exhibit a family of reachability games with N positions and m actions for each 
player in each position, such that all non-terminal positions have value one and such that value iteration as 
well as strategy iteration need at least a doubly exponential 2™ number of iterations to obtain valuations 
larger than any fixed constant (say 0.01). By inequality ([1}, it is enough to consider the strategy iteration 
algorithm to establish this. However, our proof is much easier and cleaner for the value iteration algorithm, 
the exact bounds are somewhat better, and our much more technical proof for the strategy iteration case is 
in fact based upon it. So, we shall present separate proofs for the two cases. 

Our hard instances P{N, m) for both algorithms are generalizations of the "Purgatory" games defined by 
Hansen, Miltersen and Koucky [T2] (these occur as special cases by setting m = 2). Following the conventions 
of that paper, we describe these games as being games between Dante (Player I) and Lucifer (Player II). The 
game P{N, m) can be described succinctly as follows: Lucifer repeatedly selects and hides a number between 
1 and m. Each time Lucifer hides such a number, Dante must try to guess which number it is. After the 
guess, the hidden number is revealed. If Dante ever guesses a number which is strictly higher than the one 
Lucifer is hiding, Dante loses the game. If Dante ever guesses correctly N times in a row, the game ends 
with Dante being the winner. If neither of these two events ever happen and the play thus continues forever, 
Dante loses. It is easy to see that P{N, m) can be described as a deterministic concurrent reachability 
game with N non-terminal positions and m actions for each player in each position. Also, by applying a 
polynomial-time algorithm by de Alfaro et al. [T] for determining which positions in a reachability game 
have value 1, we find that all positions except TRAP have value 1 in P{N, m). That is, Dante can win this 
game with arbitrarily high probability. 

We note that these hard instances are very natural and easy to describe as games that one might even 
conceivably have a bit of fun playing (the reader is invited to try playing P(3, 2) or P(l, 5) with an uninitiated 
party)! In this respect, they are quite different from the recent extremely ingenious turn-based games due 
to Friedman [S] where strategy iteration exhibits exponential behavior. 

Using recent improved upper bounds on the patience of e-optimal strategies for Everett's recursive games, 
we provide matching 2™ upper bounds on the number of iterations sufficient for getting adequate approx- 
imate values, by each of the algorithms. In particular, both algorithms are also of at most doubly-exponential 
complexity. 

That the doubly-exponential complexity is a real phenomenon is illustrated in Table [T^] which tabulates 
the valuations computed by strategy iteration for the initial position of P(7, 2), i.e., "Dante's Purgatory" 
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0.013 


0.035 
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0.102 


0.134 
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0.248 



Tabic 1: Running Strategy Iteration on P(7, 2). 

[12] ■ a 7-position game of value 1. The algorithm was implemented using double precision floating point 
arithmetic and was allowed to run for one hundred million iterations at which point the arithmetic precision 
was inadequate for representing the computed strategies (note that the main result of Hansen, Miltersen and 
Koucky [12] implies that roughly 64 decimal digits of precision is needed to describe a strategy achieving a 
valuation above 0.9). 

Interestingly, when introduced as an algorithm for solving concurrent reachability games [2], strategy 
iteration was proposed as a practical alternative to generic algorithms having an exponential worst case 
complexity. More precisely, one obtains a generic algorithm for solving reachability games quantitatively by 
reducing the problem to the decision problem for the existential fragment of the first order theory of the real 
numbers 0. This yields an exponential time (in fact a PSPACE) algorithm. Our results show that this 
generic algorithm is in fact astronomically more practical than strategy iteration on very simple and natural 
instances. Still, it is not practical in any real sense of this term, even given state-of-the-art implementations 
of the best known decision procedures for the theory of the reals. Finding a practical algorithm remains a 
very interesting open problem. 

1.2 Overview of proof techniques 

Our proof of the lower bound for the case of value iteration is very intuitive. It is based on combining the 
following facts: 

1. The valuations w* obtained in iteration t of value iteration is in fact the values of a time bounded version 
of the reachability game, where Player I loses if he has not reached GOAL at time t. 

2. While the value of the game P{N, m) is 1, the value of its time bounded version is very close to for 
all small values of t. 

The second fact was established by Hansen et al. [12] for the case m = 2 by relating the so-called patience of 
reachability games to the values of their time bounded version, without the connection to the value iteration 
algorithm being made explicit, by giving bounds on the patience of the games P(A, 2). The present paper 
provides a different and arguably simpler proof of the lower bound on the value of the time bounded game 
that gives bounds also for other values of m than 2. It is based on exhibiting a fixed strategy for Lucifer 
that prevents Dante from winning fast. 

The lower bound for strategy iteration is much more technical. We remark that the analysis of value 
iteration is used twice and in two different ways in the proof. It proceeds roughly as follows: The analysis 
of value iteration yields that when value iteration is applied to P(l,77i), exponentially many iterations (in 
m) are needed to yield a close approximation of the value. We can also show that when strategy iteration is 
applied to P(l, m), exactly the same sequence of valuations is computed as when value iteration is applied 
to the same game. From these two facts, we can derive an upper bound on the patience of the strategies 
computed by strategy iteration on P(l,m). Next, a quite involved argument shows that when applying 
strategy iteration to P{N,m), the sequence of strategies computed for one of the positions (the initial one) 
is exactly the same as the one computed when strategy iteration is applied to P(l, m). We also show that 
the smallest behavior probability in the computed strategy for P{N,m) occurs in the initial position. In 
particular, the patiences of the sequence of strategies computed for P{N, m) is the same as the patiences of 
the sequence of strategies computed for P(l, m). Finally, our analysis of value iteration for P{N, m) and the 
relationship between patience and value iteration allow us to conclude that a strategy with low patience for 
P{N,m) cannot be near-optimal, yielding the desired doubly-exponential lower bound. 



2 Theorems and Proofs 

2.1 The connection between patience, the value of time bounded games, and 
the complexity of value iteration 

The key to understanding value iteration is the fohowing folklore lemma. Given a concurrent reachability 
game G, we define Gt to be the finite extensive form game with the same rules as G, except that Player 1 
loses if he has not reached GOAL after T moves of the pebble. The positions of Gt are denoted by (i,i), 
where i is a position of G and t is an integer denoting the number of time steps left until Dante's time is 
out. 

Lemma 1 The valuation w- computed by the value iteration algorithm when applied to a game G is the exact 
value of position {i,t) in the game Gt- 

The proof is an easy induction in t ("Backward induction"). A very general result by Mertens and Neyman 
[IB] establishes that for a much more general class of games (undiscounted stochastic games), the value of 
the time bounded version converges to the value of the infinite version as the time bound approaches infinity. 
Combining this with Lemma [I] immediately yields the correctness of the value iteration algorithm. 

The patience [5] of a stationary strategy for a concurrent reachability game is 1/p, where p is the smallest 
non-zero behavior probability employed by the strategy in any position. The following lemma relates the 
patience of near-optimal strategies of a reachability game to the difference between the values of the time 
bounded and the infinite game and hence to the convergence rate of value iteration. 

Lemma 2 Let G be a reachability game with N non-terminal positions and with an e-optimal strategy of 
patience at most I, for some I > l,e > 0. Let T = kNl for some k > 1, and u be any position of G . Then, 
the value of position {u, T) of Gt differs from the value of the position u of G by at most e + e^^ . 

Proof We want to show that the value of {u,T) in Gt is at least w„ — e — e^*^, where Vu is the value of 
position u in G. We can assume that Vu > e, because otherwise we are done. Fix an e-optimal stationary 
strategy x for Dante in G of patience at most I. Consider this as a strategy of Gt and consider play starting 
in u. We shall show that x guarantees Dante to win Gt with probability at least «„ — e — e^'^, thus proving 
the statement. Consider a best reply y by Lucifer to x in Gt- Note that y does not necessarily correspond 
to a stationary strategy in G. The strategy can still be played by Lucifer in G, by playing by it for the first 
T time steps and playing arbitrarily afterwards. 

Call a position z; of G alive if there are paths from v to GOAL in all directed graphs obtained from G 
in the following way: The nodes of the graphs are the positions of G. We then select for each position an 
arbitrary column for the corresponding matrix, and let the edges going out from this node correspond to 
the pointers of the chosen column and rows where Dante assigns positive probability. That is, intuitively, 
a position v is alive, if and only if there is no absolutely sure way for Lucifer for preventing Dante from 
reaching GOAL when play starts in v. Positions that are not alive are called dead. Note that if a position v 
is dead, the strategy y, being a best reply of Lucifer, will pick actions so that the probability of play reaching 
GOAL, conditioned on play having reached v, is 0. On the other hand, if the current position v is alive, 
the conditional probability that play reaches GOAL within the next N steps is at least (1//)^. That is, 
looking at the entire play, the probability that play has not reached either GOAL or a dead state after T 
steps is at most (1 — [-^^'^/^ = (1 — /~^)'^' < e~'^. Suppose now that GOAL is reached in T steps with 
probability strictly less than w„ — e — e^^ when play starts in u. This means that a dead position is reached 
with probability strictly greater than 1 — {vu — e — e"'^) — e"'^, i.e., strictly greater than 1 — (u„ — e). But 
this means that if Lucifer plays y as a reply to x in the infinite game G he will in fact succeed in getting 
the pebble to reach a dead position and hence prevent Dante from ever reaching GOAL, with probability 
strictly greater than 1 — (v^ — e). This contradicts x being e-optimal for Dante in G. Thus, we conclude that 
GOAL is in fact reached in T steps with probability at least v^ — e — e^^ when play starts in u with x and 
y being played against each other in Gt, as desired. D 

The connection between the convergence of value iteration and the time bounded version of the game allows 
us to reformulate the lemma in the following very useful way. 



Lemma 3 Let G be a reachability game with an e-optimal strategy of patience at most I, for some e > 0. 
Then, T — kNl^ rounds of value iteration is sufficient to approximate the values of all positions of the game 
with additive error at most e + e~^ . 

We can use this lemma to prove our upper bound on the number of iterations of value iteration (and hence 
also strategy iteration) . The following lemma is from Hansen et al. [llj . 

Lemma 4 (Hansen, Koucky, Lauritzen, Miltersen and Tsigaridas) Let e > be arbitrary. Any 
concurrent reachability game with N positions and at most m > 2 actions in each position has an e-optimal 
stationary strategy of patience at most (l/e)*" 

This lemma is an asymptotic improvement of Theorem 4 of Hansen et al. [12j , that gave an upper bound of 
(1/e)^ , for a total number of M actions, when M > 10 and < e < i. This result does however have the 
advantage of an explicit constant in the exponent, which the bound of Lemma |4] lacks. 

Combining Lemma [H Lemma |4j and also applying inequality ([T]), we get the following upper bound: 

Theorem 5 Let e > be arbitrary. When applying value iteration or strategy iteration to a concurrent 
reachability game with N non-terminal positions and m > 2 choices for each player in each position, after 
at most (1/e)™ iterations, an e- approximation to the value has been obtained. 

Also, Lemma |3] will be very useful for us below when applied in the contrapositive. Specifically, below, we 
will directly analyze and compare the value of P{N, m) with the value of its time bounded version, and use 
this to conclude that the value iteration algorithm does not converge quickly when applied to this game. 
The lemma then implies that the patience of any e-optimal strategy is large. When we later consider the 
strategy iteration algorithm applied to the same game, we will show that the strategy computed after any 
sub-astronomical number of iterations has too low patience to be e-optimal. 

2.2 The value of time bounded Generalized Purgatory and the complexity of 
value iteration 

In this section we give an upper bound on the value of a time bounded version of the Generalized Purgatory 
game P{N,m). As explained in Section |2.H this upper bound immediately implies a lower bound on the 
number of iterations needed by value iteration to approximate the value of the original game. 

We let PT{N,m) be the time bounded version of P{N,m) as defined in Section \2A\ i.e. PT{N,m) is 
syntactic sugar for (P(7V, m))^. Also, we need to fix an indexing of the positions of P{N,m). We define 
position i for i — 1, . . . , A'^ to be the position where Dante already guessed correctly i — 1 times in a row and 
still needs to guess correctly N — i + 1 times in a row to win the game. 

First we give a rather precise analysis of the one-position case. Besides being interesting in its own right 
(to establish that value iteration is exponential even for this case), this will also be useful later when we 
analyze strategy iteration. 

Theorem 6 Let m>2 and T > 1. The value of position (1,T) of PT{l,m) is less than 

m ml 

Proof Let e = (l/mT)^/^'"^^^. Consider any strategy (not necessarily stationary) for Dante for playing 
Pt(1, m). In each round of play, Dante chooses his action with a probability distribution that may depend 
on previous play and time left. We define a reply by Lucifer in a round-to-round fashion. 

Fix a history of play leading to some current round and let pi,p2, . . . , Pm be the probabilities by which 
Dante plays 1, 2, . . . , tti in this current round. There are two cases. 

1. There is an i so that pi < (''— ^) X]i>i+iPj- We call such a round a green round. In this case, Lucifer 
plays i. 



2. For all i, pi > i^-^) J2i>i+i Pj- ^^ '^^^^ such a round a red round. In this case, Lucifer plays m. 

This completes the definition of Lucifer's reply. 

We now analyze the probability that Dante wins Pt(1, ?7i) when he plays his strategy and Lucifer plays 
this reply. We show this probability to be at most 

m ml 

and we shall be done. 

Let us consider a green round. We claim that the probability that Dante wins in this round, conditioned 
on the previous history of play, and conditioned on play ending in this round, is at most 1 — e. Indeed, this 
conditional probability is given by 



< 



(. e )y/^j>i+lPj' 



p, + (p,+ i + ■ • ■+pra) (V)(Ej>z+lPj) + (Ej>z+lPj) 

{l-e)/e + e/e 
= 1 -e. 

Let us next consider a red round. We claim that the probability of play ending in this round, conditioned 
on the previous history of play, is at most e™~^. Indeed, note that this conditional probability is exactly pm, 
and that 

m m ^ rn _. m 

^^Y.Pi=p^ + Y.pi ^ (1 + ^)(Epj) = (1 + —^)^p^ +Y.pi) 

> (1 + —lfiJ2p^) > . . . > (1 + —ly--^p„, = (-)™-V^ 

from which Pm < e"^~^. That is, in every round of play, conditioned on previous play, either it is the 
case that the probability that play ends in this round is at most £™-i (for the case of a red round) or it is 
the case that conditioned on play ending, the probability of win for Dante is at most 1 — e (for the case of a 
green round). 

Now let us estimate the probability of a win for Dante in the entire game Pt(1, fn)- Let W denote the 
event that Dante wins. Let G be the event that play ends in a green round. Also, let R be the event that 
play ends in a red round. Then, we have 

Pt[W] = Pr[VF|i?]Pr[i?]+Pr[VK|G]Pr[G'] 

< Pr[i?]+Pr[W^|G]Pr[G] 

= Pr[i?] + Pr[W|G](l - Pr[i?]) 

- Pr[i?] + Pr[M^|G] - Pr[i?] Pr[I^|G] 

< (e™-i)r+(l-e)-(e"-i)r(l-e) 
= 1 - e + Te'" 

= l-( — )i/(™-i)+T( — )^5T^ 

- l-(l-l)(^)V(™-i). 

m ml 

D 

Combining Lemma [1] with Theorem |6] we get the result that value iteration needs exponential time, even 
for one-position games. 



Corollary 7 Let < e < 1. Applying less than — (l/e)™ ^ iterations of the value iteration algorithm to 
P{l,m) yields a valuation at least e smaller than the exact value. 

Next, we analyze the iV-position case, where we give a somewhat coarser bound. 



^N-k 



Theorem 8 Let N, m, k, T be integers with N > 2,m > 2,1 < k < N - 2 and T < 2"^ . Then, the value 
ofPT{N,m) IS at most 2m-'' + 2-"""'"''. 



^N -k-1 



Proof We show an upper bound on the value of Pt{^^ in) of 2m ^ -\-2 ^'^ " "by exhibiting a particular 
strategy of Lucifer and showing that any response by Dante to this particular strategy of Lucifer will make 
Dante win with probabihty at most 2m^ + 2^™ 

To structure the proof, we divide the play into epochs. An epoch begins and another ends immediately 
after each time Dante has guessed incorrectly by undershooting, so that he now finds himself in exactly the 
same situation as when the play begins (but in general with less time left to win). That is, Dante wins if and 
only if there is an epoch of length N containing only correct guesses. For convenience, we make the game a 
little more attractive for Dante by continuing play for T epochs, rather than T rounds. Call this prolonged 
game Gip. Clearly, the value of Gt is at most the value of G^, so it is okay to prove the upper bound for 
the latter. We index the epochs 1, 2, . . . , T. 

To define the strategy of Lucifer, we first define a function / : N x N ^^ N as follows: 

i-l 

Then, it is easy to see that / satisfies the following two equations. 

/(i,m) = m^ (2) 

/(*,.? + !) = ./(*,j) + E/(^'"^) (3) 

r-O 

The specific strategy of Lucifer is this: Let d be the number of rounds already played in the current 
epoch, li d > N — k, Lucifer chooses a number between 1 and m uniformly at random, li d < N — k, he 
hides the numbers j = l,...,m — 1 with probabilities Pj{d) = 2^^^^-''~'^'"^~^^~^^ and puts all remaining 
probability mass on the number m (since N — k — d > 1 and m > 2, there is indeed some probability mass 
left for m). 

Freeze the strategy of Lucifer to this strategy. From the point of view of Dante, the game Gt is now 
a finite horizon absorbing Markov decision process. Thus, he has an optimal policy that is deterministic 
and history independent. That is, the choices of Dante according to this policy depend only on the number 
of rounds already played in the present epoch and the remaining number of epochs before the limit of T 
epochs has been played, or, equivalently, on the index of the current epoch. We can assume without loss of 
generality that Dante plays such an optimal policy. That is, his optimal policy for epoch t can be described 
by a specific sequence of actions ato, oti, at2, • • • , o,t(N-i) in {1; • • • i '^t} to make in the next N rounds (with 
the caveat that this sequence of choices will be aborted if the epoch ends). 

Se define the following mutually exclusive events Wt, Lt. 

• Wt'. Dante wins the game in epoch t (by guessing correctly N times). 

• Lt'. Dante loses the game in epoch t (by overshooting Lucifer's number) 
We make the following claim: 

jV — /ciV — fc — 1 I 

Claim: For each i, either Pr[VFt] < 2"" -™ or Pr[T4^t]/Pr[Lt] < 2m-*=. 

First, let us see that the claim implies the lemma. Indeed, the probability of Dante winning can be split 
into the contributions from those epochs where Dante wins with probability at most 2" 
the remaining epochs. The total winning probability mass from the first is at most r2" 



m —m 


and 


,-m"-'=-m"- 


-"-' < 



2^™ and the total winning probability mass of the rest is at most 2m^\ giving an upper bound for 

Dante's winning probability of 2m~^ + 2"'" 

So let us prove the claim. Fix an epoch t and let ajo, iti, 0*3, • • ■ , CLf/N-i) be Dante's sequence of actions. 
Suppose ato = 1 and a^ = 1. Then, since Lucifer only plays 1 in the first two rounds with probabil- 
ity pi(0)pi(l) = 2^^^^~'^'"^'> ■ 2~'^(^~'^^^'"'\ Dante only wins the game in this epoch with at most that 
probability, which by equation ([2|) is equal to 2^™ ~"^ , as desired. 

Now assume ato > 1 or an > 1. We want to show that Pr[M/t]/Pr[Lf] < 2ra~^ . Let d be the largest 
index so that d < N — k and so that atd > 1- Since ato > 1 or an > 1, such a d exists. Let E be 
the event that epoch t lasts for at least d rounds. We will show that Pr[Wt|£']/Pr[Lf ji?] < 2m~^ . Since 
Wt C E, this also implies that Pr[VKi]/Pr[Lt] < 2ra^^ . Since we condition on E we look at Dante's 
decision after d rounds of epoch t. He chooses the action j — atd > 1- If Lucifer at this point chooses a 
number small than j, Dante loses. In particular, since Lucifer chooses the number j — 1 with probability 
2-f{N-k-d,m+i-{j~i) ^ J3a,nte loses the entire game by his action atd with probability at least 2^-^(^~'^~'''™~-'\ 
conditioned on E. On the other hand the probability that he wins the game in this epoch conditioned on 
E is at most (2-/(^-'=-'''™+i-J))(f]^-^-i 2--'^(^-'=-*'"))(m-'=)), the first factor being the probability that 
Lucifer chooses j at round d, the second factor being the probability that Lucifer like Dante repeatedly 
chooses 1 until the last k rounds of the epoch begin, and the third factor being the probability that Lucifer 
matches Dante's choices in those k rounds. Now we have 

Vr[Wt]/VY[Lt\ < 
PY[Wt\E]/PY[Lt\E] < 

N-k-l 
/2-/(W-fc-d,m+l-i)N/ TT ^-f[N-k-i,m)-.i'^~k-.-.2fiN-k-d,rn-j) ^ 

i=d+l 
^~k2fiN-k-d,m-j)-f(N-k-d,m+l-])-^^S,''-''-^ f{r,m) ^ 

2m-^ 

as desired. D 

Combining Lemma [T] with Theorem [8] we get the result that value iteration needs doubly exponential time 
to obtain any non-trivial approximation: 

Corollary 9 Let N he even. Applying less than 2™ iterations of the value iteration algorithm to P{N, ni) 
yields a valuation of the initial position of at most 3to~^", even though the actual value of the game is 1. 

We also get the following bound on the patience of near-optimal strategies of P{N, m) that will be useful 
when analyzing strategy iteration. 

Theorem 10 Suppose N is sufficiently large and m > 2. Let e = 1 — Am^^''^ . Then all e-optimal strategies 
of P{N, ni) have patience at least 2™ 

Proof Putting c — E2rj21^ Lemma [5] tells us that if P{N, m) has an e-optimal strategy of patience less than 
I = 2™"^', then the value of Pt{N,m) is at least 1 - e - e^^^ = Sm^^/^^ where t = cNl^ < 2""'". But 
putting k = N/2, Theorem [8] tells us that the value of Pt{N, m) is at most 2m-^/^ + 2-™"^'"' < Sto"^/^, 
a contradiction. D 

2.3 Strategy Iteration 

The technical content of this section is a number of lemmas on what happens when the strategy iteration 
algorithm is applied to P{N, m), leading up to the following crucial lemma: 
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Lemma 11 When applying strategy iteration to P{N,m), the patience of the strategy x* computed in itera- 
tion t is at most e ■ m ■ t. 

Before we prove Lemma [TTl we show that it imphes the lower bound we are looking for. 

Af/4 

Theorem 12 Suppose N is sufficiently large. Applying less than 2™ iterations of strategy iteration to 
P{N, m) yields a valuation of the initial position of less than Am^^''^ , despite the fact that the value of the 
position is 1. 

Proof Lemma [11] implies that the patience of the strategy x* computed in iteration t for t = 2™ is at 
most em2"^ . Theorem [TUl states that if e = 1 — im^^/^, then all e-optimal strategies of P{N,m) have 
patience at least 2™ . So a;* is not e-optimal and the bound follows. D 

To prove Lemma I 111 we need to understand strategy iteration on P(m,N) and shall through a number 
of lemmas establish: 

• For the one-position case P{l,m), value iteration and strategy iteration are "in synch", i.e., iij = vj 
for all i and t. 

• When applying strategy iteration to P(N,m), the strategy computed for position 1 after t iterations 
is the same as that computed by strategy iteration applied to P(l,m) after t iterations. 

• When applying strategy iteration to P(N,m), the smallest behavior probability computed occurs at 
position 1 and the patience of the strategy computed can therefore be determined by looking at that 
position. 

In all lemmas below, unless otherwise mentioned, we consider applying the strategy iteration algorithm 
to P{N,m) and the quantities u*,a;*, etc., are those computed by this algorithm. 

Lemma 13 Vt, i e 1, 2, . . . , TV -(- 1 : u* > 

Proof For t — l,we have that x^ is the uniform distribution at each position. We then see that v^^ — — > 0, 

since no matter which number Lucifer chooses, Dante selects the right one with probability — . We also see 

that Dante has a probability of winning i times in a row of ^ > 0. We therefore have that V'^_^^^ > ;^ > 0. 

We know that w*+^ > w* (see, e.g., Chatterjee et al. [2]), so Vi,i : < w° < w*. D 

Lemma 14 Vt, i £ {1, . . . ,iV}, j £ {1, . . . ,m} : < a;*j < 1 

Proof Since Mi, t : X^i'li ^\ j — 1 '^^ only need to show that x\ ■ > 0. We will do the proof by contradiction. 
Assume that 3t,i,j : x* .- = 0. If Lucifer replies to x* by choosing j in position i, play reaches GOAL with 
probability 0. Therefore vj = which we showed was not the case in Lemma [T3l D 

Lemma 15 \/t,i : vj < 1 

Proof Since Mt,i,j : a;* > 0, by Lemma [HI we have that all strategies for Lucifer in position i, y, except 
for Lucifer always choosing m, will make Dante lose with positive probability. In particular, the best reply 
by Lucifer to x* must have that property. D 

Lemma 16 Vt, i, n : vl > vl_i 

Proof Recall that u* is the winning probability of Dante if play starts in position i when he plays using 
a;* and Lucifer plays a best reply. By construction of P{N, m) we have that any winning play starting in 
position i — 1 must subsequently visit position i. Therefore, w* > f*„i. By Lemma [Ml we have that, Lucifer 
can play 1 in position i — 1 and hence prevent, with positive probability, Dante from proceeding to position i 
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from position i~l. Dante therefore might lose the game in position i — 1 with positive probabiUty. Therefore, 

vl > v\_^. U 

To proceed, we need to consider the matrix games that arises when strategy iteration is executed on 
P(N 1 m). Fortunately, these are all of a special form that can be easily analyzed. 

For a real number z with < z < 1, let B(z) be the mxm matrix of real numbers with 1 in the diagonal, 
in all entries below the diagonal, and z in all entries above the diagonal. Also, considering B{z) as a matrix 
game with the row player being the maximizer, let p^ be an optimal strategy for the row player and q^ be an 
optimal strategy for the column player. Finally, we let v'^ be the value of the matrix game. Straightforward 
calculations, which we will omit, yield the following facts about the matrix game B{z). 

Lemma 17 For all values < z < 1, the matrix game B{z) has the following properties. 

• The row player has a uniquely determined optimal strategy p^ . This strategy is fully mixed. 

• For all i > 1, we have that p| = Pi(l ~ zY^^ . 

Lemma 18 IfO<y<z<l, the optimal strategies p^ ^p^ satisfy: p\ < p\ and p^^ > p^. 

The connection between strategy iteration and the matrix game B(z) is given by: 

( for t^l / / '\\ 

Lemma 19 For allt,i, let z — < vl^^ r /^ ^ i • Under the assumption thatyi,t' < t : val iAi Iv* j j > 

f* , the strategy x\ computed by strategy iteration on P{N, m) is p^ . 

Proof For i = 1, we see that the optimal strategy for both players in the matrix game B{z) which is in this 
case the matrix defined by the identity matrix is to play uniformly in B{0) which is the same strategy as a;^ 
and y°. 

For t > 1, we see that, if we update x*, which we do by assumption, x* is the optimal solution for the row 
player in the matrix game given by the mxm with u*7]^ in the diagonal, in all entries below the diagonal 
and v-j^ in all entries above the diagonal. We can divide each entry in this matrix by v^Ti , per Lemma 1131 



This yields the matrix B -^^rr I . The new matrix will have the same optimal strategies for the row player. 
By Lemma fT3l and fT6l we have that < ^^rr < 1. Therefore, x* is exactly p^. D 

Lemma 20 When applying strategy iteration to P{N,m), if Lucifer's best reply y* is equal to the strategy 
that chooses 1 in all positions, then 



Proof v\ = rij^fc 2^7 1^ from which the statement follows. D 

Lemma 21 // Lucifer's best reply y* is equal to the strategy that plays 1 in all positions for all s <t, then 

Vi,<'<t:val(A,(w*')) > vf . 

Proof We will show the statement using induction in t' . 

We see that val(A,(«*')) = v^^^ ■ val(A,(^)) = v^^^ ■ val(B(^)). 

We can also see that v\ = u*_(_]^ ' ^\ii since we know that Lucifer played 1 at time t' (so Dante loses if he 
plays p > 1 and must win from position i + 1 otherwise). 
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So we just need to show that xj ^ < val(i?(-^)). 

For t' = 1: 

We can see that xj^ = val(i3(0)). 

By Lemma \T7\ we have that xj ^ = val(i?(0)) < val(i?(-r— )) and the result follows. 

For t' >1: 

Since Lucifer played 1 at time i' — 1, we can use Lemma 1171 and Lemma 1191 and get that, for all j, 

/ t'-i t' t'-i 

; I = val(i3( "]:,„^ )), especially for i ~ j. By Lemma [TE\ we just need to show that ^p— > ^tttt- 



j+i 



We can use Lemma [20l and we get that ^^ = n*-i x* , and that "^_, = n*-i a;* n ^ . We will show that 



.J -1 



I > X*, ^^ and the result follows, since x*, ^^ > 0, by Lemma [TH But since x*i — va\{B {^r-rr)) , this is the 



J.i J,i .v.„^.„ ^„. — „, „„.„„ .^j 



"j+i 



induction hypothesis. D 

f for t=l 
Lemma 22 For all t,i, let z — I v\^'^ „ + i • Then, the strategy x\ computed by strategy iteration 

[ ^ ^^ ^ 
on P{N,m) is p^ , under the assumption that Lucifer chooses 1 for \/t' < t and all positions. 

Proof The result follows from Lemma flQl and Lemma [211 D 

Lemma 23 yt,i,j : x* , = xj ]^(1 l:rry~^, under the assumption that Lucifer chooses 1 for \ft' < t and 

all positions. 

Proof This follows from Lemma [17] and Lemma [22l D 

Lemma 24 \/t,i > 1 : x\_ij^ < x\^, under the assumption that Lucifer chooses 1 for \/t' < t and all 
positions. 

Proof The result follows from Lemma [22| and Lemma [ITl since we have that 



i+l ^ ''i -^ _,t_i ^ t-i 



from Lemma [T3| and 1161 D 

Lemma 25 When applying strategy iteration to P{N, m), if Lucifers best replies y^, . . . , y* are all equal to 
the strategy that chooses 1 in all positions, then Vi' <t,i:x\i < x\^ 

Proof The proof will be by induction in t' . 

For t' = 1 : From Lemma [HI we have that x\ is the optimal strategy of the row player of the matrix 

game BUS). Since < -Jr— < 1, by Lemma [T^ the result follows from Lemma [TBI 

For t' > 1, we have the induction hypothesis: Vi : x* j"^ < x* j^. By Lemma [22] we have that x* is an 
optimal strategy for the row player in BI ^],_-^ I and x* ^^ is an optimal strategy for the row player in 



BI -^ j . By Lemma [201 we have ^],_i = YTj=i ^] i ^^^ ~P^ — Ylj=i ^j-,i From the induction hypothesis 

and Lemma [TH we have that Yl^i ^ i^ < JTj=i ^ i- 

So, X* is the optimal strategy for the row player in S( J^^^j^ x* ^"^ ] and x* ^^ is the optimal strategy for 

the row player in BI Y[j=i xl i ) and the lemma follows from Lemma [TBI D 
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Lemma 26 Consider a stationary strategy x for Dante in P{N, m) that is fully mixed, i.e., assigns positive 
probability to all actions. We may consider x to be a strategy also for P{k, m) for some k < N by identifying 
each position i G {1, .., k} in P{k, m) with position i in P{N, m). Suppose a pure strategy y of Lucifer is a 
best reply to x in P{N, m). Then, its restriction to positions 1, . . . , fc is also a best reply to x in P{k, m). 

Proof 

We divide the non-terminal positions of P{N, m) into two sets of positions, S = {1, 2, . . . , fc} and T — 
{fc + 1, . . . , N}. We note that the only position the pebble can move to in T directly from 5 is fc + 1. Similarly 
we note that the only position the pebble can move to in S directly from T is 1. 

For a specific fully mixed x and a reply y for P{N, m), an absorbing Markov process on the set of positions 
is induced. Let Qs,t be the probability that the pebble eventually arrives at position fc + 1, if the process 
is started in position 1. Let Qt,s be the probability that the pebble eventually arrives at 1, if the process is 
started in fc + 1. Let Qs^trap be the probability that the pebble goes to TRAP, without first visiting T, if 
the process is started in position 1. Similarly, define Qt.goal to be the probability that the pebble arrives at 
GOAL without first visiting 1 if the process is started in position fc + 1, and Qt,trap to be the probability 
that the pebble arrives at TRAP without first visiting 1 if the process is started in position fc + 1. Observe 
that Qs,* and Qt,* are probability distributions, since the probability for a play of infinite length within S 
and T is 0, because x is assumed to be fully mixed. 

For u S {1, . . . , fc} let Qu,t be the probability that the pebble reaches T when started in u when x and y 
are played. Note that best replies y to x in the restricted game P{k, m) are characterized by being those y 
minimizing all probabilities Qu,t simultaneously for all u £ {1, . . . , fc}, among all possible y, since reaching 
GOAL in P{k,m) amounts to reaching T in P{N,m). But note that in the original P{N,m) game, the 
probability of Dante reaching GOAL, when play starts in some u £ {1, . . . , fc} is given by 

Qu,TQT,oo..Y.{QT,sQs,Ty = ^-'^^^'-^^^ . (4) 

^ 1 - Qt,sQs,t 

Since Qs,t — Qi.T, we have that if the behavior of y in positions fc + 1, . . . , m is fixed (and hence also Qt,* 
is fixed), the behavior of y in positions 1, . . . , fc that simultaneously minimizes (jlj for all u is exactly the 
same behavior that simultaneously minimizes Qu,t. This concludes the proof. D 

Lemma 27 When applying strategy iteration to P{N,m), we have that for all t > 1, the best reply j/* 
computed is the one where Lucifer chooses 1 in all positions. 

Proof For i = 1, we see that for all strategies Lucifer can select Dante guess correctly with probability — as 
a;^ is the uniform choice in each position. If Lucifer plays 1, Dante will lose the entire game immediately with 
probability Si^ at each position and advance one step with probability — . Any other choice of Lucifer will 
preserve the advancement probability but decrease the probability that Dante loses the game immediately 
(replacing the probability mass with a probability of going to the initial position) . We conclude that choosing 
1 is Lucifer's best reply. 

So we only need to look at i > 1. We will do the proof using contradiction. 

Let t be the lowest value, such that there exists N, m so that when applying strategy iteration to P{N, m), 
the reply y* does not choose 1 in every position. Also, let N be the lowest such N and let i be the smallest 
i so that J/* does not pick 1 in position i. 

That is, for any position k < i, y^ chooses action 1 in position fc, so to determine the best reply y*, we just 
need to determine its action in position i. By Lemma B51 if we restrict x* to positions !,...,« and consider 
the game P{i, m), the reply y*, restricted to P{i, m), is also a best reply to x* in this game. We shall in fact 
prove that in this game, Lucifer's reply is not best, unless it chooses 1, also in position i. This will yield the 
desired contradiction. We shall look at each of Lucifer's possible actions in position i. 

If Lucifer chooses 1, and play starts in position i, Dante wins P{i,m) if he chooses 1. This Dante has a 
probability of x* i of doing. 
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On the other hand, if Lucifer chooses p > 1, at position i, Dante will go back to state 1 if he chooses 
1, ... ,p — 1 and win immediately if he chooses p. 

So each time Dante chooses 1, ... ,p — 1, which he does with probability X^^i ^l he has to get back to 
position i from position 1. Since Lucifer uses strategy y*, Dante needs to chooses 1 in all positions from 1 
to i — 1, which he has a probability of X]?=i ^l j °f doing. Each time he is at position i' he has a probability 
of X* „ to win. 

His probability for winning is therefore 

--si6--)(j5-^-))'^ ^-(ES4:)(na4.) 



which, by Lemma [^ is equal to 



^1,1 I 1 



which, by Lemma [20l is equal to 



i-(E?ii<,)(n}il4i 

We will show using induction in p > 1, that Lucifer is better off if he always chooses 1, than if he always 
chooses p. That is: 

Vp>l:xi< j-^ \ . ' ^. (6) 

For p = 2, we may argue as follows. By Lemma [25l we have that 117=1 ^^fi < 117=1 ^\\- Since x\-i > 0, 
this implies 

, <i(i-n;=i4a') 
1 - nj=i 4,1 

which is the statement we wanted to prove. 

For p > 2, we argue as follows. The right hand side of ([HI) is 



4i(i-n}=i44T'' 
i-fe;i<.)(n;=l4i 
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Applying Lemma [^ this m.ay be rewritten as 



^i,l (1 llj = l ■^j.l 






■'-i,! ( 1 lli = l ^l4 



1 - (e-? (i - n;., ^y^ ) (n;., -h) - (i - n;., ^'' (n;., 4. 



To bound (|S]), we use the induction hypothesis: 

,p-2 



(8) 



a^li < 



,i(i-n;^i4^^ 



We note that the induction hypothesis imphes 






i-n4i4^^^"' 



and conclude that the expression (|8]) is at least: 



•^1,1 (1 llj=i ^j,i 



i-n;=i4i 

which, by equation ([7]) is strictly greater than x* j^ , as desired. D 

Lemma 28 Let x* be the behavior strategies computed when applying strategy iteration to P(N,m). Let a;* 
be the behavior strategies computed when applying strategy iteration to P{l,m). Then, for all t, x\ = x\. 

Proof We show this by induction in t. For t — 1, both x\ — — and x\ = —. For t > \, Lemma E71 states 
that J/* chooses 1 in every position. By Lemma [^ we have that x\ — p^, where z — ^^^^ = a;*~ , where the 

last equation is by Lemma EDI 

On the other hand, applying strategy iteration to P(l, to), yielding strategies x\, we similarly get x\ — p^ , 
where z = x\~ . Since x\~ = x\~ by induction, we are done. D 

Lemma 29 Applying strategy iteration to P(1,to) yields valuations v* = v* , i.e. strategy iteration computes 
the same valuations as value iteration. 
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Proof We show this by mduction in t. By Lemma \T7\ y* ^ is the strategy that chooses 1. Thus, Dante 
wins if and only if he chooses 1 in the first round and we have Vi~ — x^^ . On the other hand, by Lemma 
I22[ we have that x\ i = p^ where z = ccj^"]^ . Thus, u* = p^ where z = v-^^ . Inspecting the vahie iteration 
algorithm we find that we also have that w* = p^ where z — v\~ , and since we can see by inspection that 
we also have v^ — v^, we are done. D 

Note that Lemma UHl together with Corollary [7] yields our previously stated claim that strategy iteration 
may need exponential time to achieve non-trivial approximations for a one-position game. 

Finally, the proof of Lemma [Til (stating that when applying strategy iteration to P{N, m), the patience 
of the strategy x* computed in iteration t is at most emt) Proof [Proof of Lemma [TT] By Lemma [Ml Lemma 
II 71 and Lemma [TSl we have that the smallest behavior probability in a;* is x\ ,„, i.e., the probability of playing 
m in the start position where Dante still has to guess correctly N times to win. 

Then, by Lemma [551 to estimate this probability, we can consider P{l,m) instead of P{N,m). In fact 
we shall consider the valuations u* computed when applying strategy iteration to P(1,to). By Lemma 1291 
the values computed are the same as those w* computed by value iteration on P(1,7ti). So, by Theorem [6l 
and Lemma m we have that w* < 1 - (1 - ;^)(;;it W^""^^- That is, 1 - w* > (1 - ;^)(;^)^/^'"~^^- Now, 
Lemma[17]tells us that x\ ^ > ((1 - J-)(J- i/(™-i))™-i = (1 - J-)™-i(J-^) > -J-^ and we are done. D 
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